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We address the problem of structural disambiguation in syntactic parsing. In psycholinguistics, a 
number of principles of disambiguation have been proposed, notably the Lexical Preference Rule 
(LPR), the Right Association Principle (RAP), and the Attach Low and Parallel Principle (ALPP). 
We argue that in order to improve disambiguation results it is necessary to implement these prin- 
ciples on the basis of a probabilistic methodology. We define a 'three-word probability' for im- 
plementing LPR, and a 'length probability' for implementing RAP and ALPP. Furthermore, we 
adopt the 'back-off method to combine these two types of probabilities. Our experimental results 
indicate our method to be effective, attaining an accuracy of 89.2%. 

1 Introduction 

Structural disambiguation is still a central problem in natural language processing. Lo com- 
pletely resolve ambiguities, we would need to construct a human-like language understanding 
system (c.f.[ Altmann and Steedman 85, Johnson-Laird 83|| ). The construction of such a system 
is extremely difficult, however, and we need to adopt a more realistic approach. In psycholin- 
guistics, a number of principles have been proposed which attempt to modelize the human dis- 
ambiguation process. The Lexical Preference Rule (LPR) [ |Ford et al. 82j ], the Right Association 
Principle (RAP) [Kimball 73 ], and the Attach Low and Parallel Principle (ALPP, an extension 



of RAP) [Hobbs and Bear 9C] have been proposed, and it is thought that we might resolve ambi- 
guities quite satisfactorily if we could implement these principles sufficiently | Hobbs and Bear 90| , 



Whittemore et al. 90]. Methods of implementing these principles have also been proposed (e.g., 



|5hieber 83 , Wermter 89 , Wilks et al. 85| 1). An alternative approach is to view language as a 
stochastic phenomenon, particularly from the viewpoint of information theory and statistics. If 
we could properly define a probability modelQ and calculate the likelihood value of each interpre- 
tation using the model, we might also resolve ambiguities quite well. There have been a number of 
methods proposed to perform structural disambiguation using probability models, many of which 
have proved to be quite effective [ Alshawi and Carter 95 , Black et al. 92| , |Briscoe and Carroll 93| , 
Chang et al. 92, Collins and Brooks 95, Fujisaki 89, Hindle and Rooth 91, Hindlc and Rooth 93| , 
Jelinek et al. 90, Magerman and Marcus 91, Magerman 95, Ratnaparkhi et al. 94, Resnik 93 1 



[|Su and Chang 8§ . 



A representation of a probability distribution is called a 'probability model,' or simplely a 'model.' 
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Although each of the disambiguation methods proposed to date has its merits, none resolves 
the disambiguation problem completely satisfactorily. We feel that it is necessary to devise a new 
method that unifies the above two approaches, i.e., to implement psycholinguistic principles of 
disambiguation on the basis of a probabilistic methodology. Most psycholinguistic principles have 
been developed on the basis of vast data of actual observations, and thus a method based on them 
is expected to achieve good disambiguation results. Probabilistic methods of implementing these 
principles have the merit of being able to handle noisy data, as well as being able to employ a 
principled methodology for acquiring the knowledge necessary for disambiguation. 

LPR, RAP and ALPP are known to be effective for disambiguation, and these are the ones 
whose implementation we consider in the present paper. Thus our problem involves the following 
three subproblems: (a) resolving structural ambiguities based on LPR in terms of probabilis- 
tic representations, (b) resolving structural ambiguities based on RAP and ALPP in terms of 
probabilistic representations, and (c) combining the two. For subproblem (a), we have devised 
a new method, based on LPR, which has some good properties not shared by the methods pro- 
posed so far [Alshawi and Carter 95, Chang et al. 92, Collins and Brooks 95| , Hindle and Rooth 91 



Ratnaparkhi et al. 94| , Resnik 93| ] . In [ Li and Abe 95 ], we have described this method in detail. In 
the present paper, we mainly describe our solutions to subproblems (b) and (c). For subproblem 
(b), we point out that the notion of the 'length' of a syntactic category is important, and propose 
to use a 'length probability' to perform structural disambiguation. For subproblem (c), we propose 
to adopt the 'back-off' method, i.e., to make use first of a lexical likelihood based on LPR, and then 
a syntactic likelihood based on RAP and ALPP. Experiments conducted to test the effectiveness 
of our method demonstrate an encouraging accuracy of 89.2%. 



2 Psycholinguistic Principles of Disambiguation 

In this section, we introduce the psycholinguistic principles of disambiguation. Kimball has pro- 
posed the Right Association Principle (RAP) [Kimball 73 1, which states that (in English) a phrase 



on the right should be attached to the nearest phrase on the left if possible. Hobbs & Bear have 
generalized RAP to the Attach Low and Parallel Principle (ALPP) [ Hobbs and Bear 90f| . ALPP 



states that a phrase on the right should be attached to the nearest phrase on the left if possible, 
and that phrases should be attached to a phrase in parallel if possible. (When we refer to ALPP, 
we ordinarily mean just the part concerning attachments in parallel. ) Ford et al. have proposed 
the Lexical Preference Rule (LPR) which states that an interpretation is to be preferred whose case 



frame assumes more semantically consistent values [Ford et al. 82]. Classically, lexical preference 
is realized by checking consistencies between 'semantic features' of slots and those of slot values, 
namely the 'selectional restrictions' [Katz and Fodor 63]. The realization of lexical preference in 



terms of selectional restrictions has some disadvantages, however. Interpretations obtained in an 
analysis cannot, for example, be ranked in their preferential order. Thus one cannot adopt a strat- 
egy of always retaining the N most plausible partial interpretations in an analysis, which is the most 
widely accepted practice at present. In fact it is more appropriate to treat the lexical preference as 
a kind of score representing the association between slots and their values. In the present paper, 
we refer to this kind of score as 'lexical preference.' For the same reason, we also treat 'syntactic 
preference' as a kind of score. 



The length of a syntactic category in simply defined as the number of words contained in that category. 
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LPR is a lexical semantic principle, while RAP and ALPP are syntactic ones, and in psyche-lin- 



guistics it is commonly claimed that LPR overrides RAP and ALPP [ Hobbs and Bear 90 1, Let us 



consider some examples of LPR and RAP in this regard. For the sentence 

I ate ice cream with a spoon, (1) 

there are two interpretations; one is 'I ate ice cream using a spoon' and the other 'I ate ice cream and 
a spoon.' In this sentence, a human speaker would certainly assume the former interpretation over 
the latter. From the psycholinguistic perspective, this can be explained in the following way: the 
former interpretation has a stronger lexical preference than the latter, and thus is to be preferred 
according to LPR. Moreover, since LPR overrides RAP, the preference is solely determined by LPR. 
For the sentence 

John phoned a man in Chicago, (2) 

there are two interpretations; one is 'John phoned a man who is in Chicago' and the other 'John, 
while in Chicago, phoned a man.' In this sentence, a human speaker would probably assume the 
former interpretation over the latter. The two interpretations have an equal lexical preference value, 
and thus the preference of the two cannot be determined by LPR. After LPR fails to work, the 
former interpretation is to be preferred according to RAP, because 'a man' is closer to 'in Chicago' 
than 'phone' in the sentence. 

LPR implies that (in natural language) one should communicate as relevantly as possible, while 
RAP and ALPP implies that one should communicate as efficiently as possible. Although the 
phenomena governed by these principles vary from language to language, the principles them- 
selves, we think, are language independent, and thus can be regarded as fundamental principles 
of human communication. According to Whittemore et al. and Hobbs & Bear, nearly all of the 
ambiguities can be resolved by first applying LPR and then RAP and ALPP | Hobbs and Bear 90| , 



Whittemore et al. 90]. These observations motivate us strongly to implement these principles for 
disambiguation purposes. 

While there are also other principles proposed in the literature, including the Minimal Attach- 



ment Principle [ Frazier and Fodor 79j ], they are generally either not highly functional or covered 



by the above three principles in any case fHobbs and Bear 9C , Whittemore et al. 90 ] 



The necessity of developing a disambiguation method with learning ability has recently come 
to be widely recognized. The realization of such a method would make it possible to (a) save the 
cost of defining knowledge by hand (b) do away with the subjectivity inherent in human definition 
(c) make it easier to adapt a natural language analysis system to a new domain. We think that a 
probabilistic approach is especially attractive because it is able to employ a principled methodology 
for acquiring the knowledge necessary for disambiguation. In our research, we implement LPR, RAP 
and ALPP by means of a probabilistic methodology. 

3 LPR and Lexical Likelihood 

In this section, we briefly describe our LPR-based probabilistic disambiguation method. 
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3.1 The three- word probability 

We refer to a syntactic tree and its corresponding case frame, as obtained in an analysis, 'an 
interpretation.'^] After analyzing the sentence in (p]), for example, we obtain the case frames of the 
interpretations: 

eat:[argl I, arg2 ice_cream, with spoon], (3) 

and 

eat:[argl I, arg2 ice_cream: [with spoon]]. (4) 

The value assumed by a case slot of a case frame of a verb can be viewed as being generated 
according a conditional probability distribution: 

P(n\v,s), (5) 

where random variable v takes on a value of a set of verbs, n a value of a set of nouns, and s 
a value of a set of slot names. Similarly, the value assumed by a case slot of a case frame of a 
noun can be viewed as being generated by a conditional probability distribution: P(n\n,s). We 
call this kind of conditional probability the 'three- word probability' Moreover, we assume that the 
three- word probabilities in the case frame of an interpretation are mutually independent, and define 
the geometric mean of the three-word probabilities as the 'lexical likelihood' of the interpretation: 

m 

PUl) = (I[P*) 1/m , (6) 

i=l 

where Pi is the ith three-word probability in the case frame of interpretation /, and m the number 
of three-word probabilities in it. The lexical likelihood values of the two interpretations in @ and 
(Q) are thus calculated as 

Piex(h) = (-P(I|eat, argl) x P(ice_cream|eat, arg2) x P(spoon|eat, with)) 1 / 3 , (7) 

and 

Piexih) = (P(I|eat, argl) x P(ice_cream|eat, arg2) x P(spoon|ice_cream, with)) 1 / 3 . (8) 

In disambiguation, we simply rank the interpretations according to their lexical likelihood values. If 
a verb (or a noun) has a strong tendency to require a certain noun as the value of its case frame slot, 
the estimated three-word probability for such a co-currence will be very high. To prefer an inter- 
pretation with a higher lexical likelihood value, then, is to prefer it based on its lexical preference. 
Specifically, in order to perform pp-attachment disambiguation in analysis of sentences like ([!]), we 
need only calculate and compare the values of P(spoon|eat, with) and P(spoon|ice_cream, with). In 
sentences like 

A number of companies sell and buy by computer, (9) 

the number of three-word probabilities in each of its respective interpretations will be different. 
If we were to define a lexical likelihood as the product of the three-word probabilities in the case 
frame of an interpretation, an interpretation with fewer case slots would be preferred. We use the 
definition of lexical likelihood described above to avoid this problem.^ 

3 We do not take into account ambiguities caused by word senses. 

4 An alternative for resolving this kind of ambiguity (coo rdinate structure ambigui t y) is to em ploy a method which 



examines the similarity that exists between conjuncts (c.f. [ Kurohashi and Nagao 94, Resnik 93]) 
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3.2 The data sparseness problem 



Hindle & Rooth have previously proposed resolving pp-attachment ambiguities with 'two-word 



probabilities' | Hindle and Rooth 91 ], e.g., P(with|ice_cream), P(with|eat), but these are not accu- 

(10) 



rate enough to represent lexical preference. For example, in the sentences, 

Britain reopened the embassy in December, 



Britain reopened the embassy in Teheran, 

the pp-attachment sites of the two prepositional phrases are different. The attachment sites would 
be determined to be the same, however, if we were to use two- word probabilities (c.f. |Resnik 93| ), 
and thus the ambiguity of only one of the sentences can be resolved. It is very likely, however, that 
this kind of ambiguity could be resolved satisfactorily by using the three-word probabilities. 

The number of parameters that need to be estimated increases drastically when we use three- 
word probabilities, and the data available for estimation of the probability parameters usually are 
not sufficient in practice. If we employ the Maximum Likelihood Estimator, we may find most of 
the parameters are estimated to be 0: a problem often referred to, in statistical natural language 
processing, as the 'data sparseness problem.' (The motivation for using the two- word probabilities 



in [|Hindle and Rooth 91| appears to be a desire to avoid the data sparseness problem.) One may 
expect this problem to be less severe in the future, when more data are available. However, 
as data size increases, new words may appear, and the number of parameters that need to be 
estimated may increase as well. Thus, the data sparseness problem is unlikely to be resolved. 
A number of methods have been proposed, however, to cope with the data sparseness problem. 
Chang et al, for instance, have proposed replacing words with word classes and using class-based 
co-occurrence probabilities [ Chang et al. 92| ], However, forcibly replacing words with certain word 



classes is too loose an approximation, which, in practice, could seriously degrade disambiguation 
results. Resnik has defined a probabilistic measure called 'selectional association' in terms of the 
word classes existing in a given thesaurus. While Resnik's method is based on an interesting 
intuition, the justification of this method from the viewpoint of statistics is still not clear. We 
have devised a method of estimating the three-word probabilities in an efficient and theoretically 



sound way [Li and Abe 95 1. Our method selects optimal word classes according to the distribution 
of given data, and smoothes the three-word probabilities using the selected classes. Experimental 
results indicate that our method improves upon or is at least as effective as existing methods. 
Using our method of estimating (smoothing) probabilities, we can cope with the data sparseness 
problem. However, for the same reason as described above, the data sparseness problem cannot be 
resolved completely. We propose combining the use of three-word probabilities and that of two- 
word probabilities. Specifically, we first use the lexical likelihood value calculated as the geometric 
mean of the three-word probabilities of an interpretation; and when the lexical likelihood values of 
obtained interpretations are equal, including the case in which all of them are 0, we use the lexical 
likelihood value calculated as the geometric mean of the two- word probabilities of an interpretation. 



4 RAP,ALPP, and Syntactic Likelihood 

In this section, we describe our probabilistic disambiguation method based on RAP and ALPP. 



5 



4.1 The deterministic approach 



Shieber has previously proposed incorporating RAP into the mechanism of a shift-reduce parser 
phieber 83 1. When RAP is implemented, the parser prefers shift to reduce whenever a 'shift- 



reduce conflict' occurs. The advantage of this deterministic approach is its simple mechanism, 
while the disadvantage is that although it can output the most preferred interpretation, it cannot 
rank interpretations in their preferential order. In order to be able to rank interpretations in this 
way, it is necessary to construct a parser which operates stochastically, not deterministically. 

4.2 Formalizing a syntactic preference 

In this subsection, we formalize a syntactic preference based on RAP and ALPP. While we borrow 




Figure 1: RAP, ALPP and length 



from the terminology of HPSG [Pollard and Sag 87] in our reference to 'head' categories, we also 



use the term 'modifier' categories to refer to categories which HPSG would classify as being either 
'complements' or 'adjuncts.' We refer to that word which exhibits the subcategory feature of a 
category to be that category's 'head word.' 

Let us consider a simple case in which we are dealing with a modifier category M, a head 
category H, and the head word of H, w. We first apply CFG rule L — > H, M to H and M, yielding 
category L (see Figure [j](a)). We refer to the number of words in a given sequence as 'distance.' 
As may be seen in Figure |l](a) , the distance between M and w is d. RAP prefers an interpretation 
with a smaller d. Thus, syntactic preference can be represented by a monotonically decreasing 
function of d. Since in English the head word w of category H tends to locate near its left corner, 
we can approximate d as Z, the number of words contained in H. In this paper, we call the number 
of words contained in a category the 'length' of that category. In addition, syntactic preference 
also depends on type of head category and modifier category. Assume that I is known to be 5; if 
H is a verb phrase and M is a prepositional phrase, the preference value is likely to be high, but if 
H is a noun phrase and M is a prepositional phrase, it is likely to be low. Since category type can 
be specified within a CFG rule, syntactic preference can be defined as a function of a CFG rule. 
Syntactic preference based on RAP can be formalized, then, as a function of CFG rule L — > H, M 
and length I, namely, 

S(l,(L^H,M)). (11) 

Suppose that categories R\ and R2 form a coordinate structure, and l\ and I2 are the lengths 
of R\ and R2, respectively. ALPP prefers categories forming a coordinate structure to be of equal 
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length (see Figure |](b)). Preference value will be high when l\ equals l 2 , and syntactic preference 
based on ALPP0 can be defined as 

S(h,l 2 ,{L^R 1 ,C,R 2 )). (12) 

Further, suppose that categories R\, R 2 , ■ ■ ■ , Rk are combined into category A, and l\, l 2 , ■ ■ ■ , Ik 
are the lengths of Ri, R 2 , . . . , Rk, respectively. Syntactic preference of the attachment can then be 
defined as 

S{h,l 2 ,...,l k ,(L^ R 1 ,R 2 ,...,R k )). (13) 



Note that (13) contains ( |1T| ) and (12). Furthermore, we assume that the attachments in the 
syntactic tree of an interpretation are mutually independent, and we define the product (or the 
sum, depending on the preference function) of the syntactic preference values of the attachments 
in the syntactic tree of the interpretation as the syntactic preference of the interpretation: 



in 



Ssyn(I) = Y[S U (14) 



i=l 



where Si denotes the syntactic preference value of the ith attachment in the syntactic tree of 
interpretation I, and m the number of attachments in it. 



4.3 The length probability 

We now consider how to specify the syntactic preference function in (|l3|). As there are any number 
of ways to formulate the function (note the fact that syntactic preference is also a function of a 
CFG rule.), it is nearly impossible to find the most suitable formula experimentally. To cope with 
this problem, we used machine learning techniques (recall the merits of using machine learning 
techniques in disambiguation, as described in Section |2|). Specifically, we have defined a probability 
model to calculate syntactic preference. Suppose that attachments represented by CFG rules and 
lengths are extracted from the correct syntactic trees in training data, and the frequency of each 
kind of attachment is obtained as 



f(l 1 ,l 2 ,...,l k ,(L^R 1 ,R 2 ,...,R k )), (15) 

where L — > Ri, R 2 , . . . , Rk denotes a CFG rule, and h,l 2 , . . . ,lk denote the lengths of R±, R 2 , . . . , Rk, 
respectively. RAP prefers an interpretation attached to a nearer phrase, while ALPP prefers inter- 
pretations with attachments that are low and in parallel. Many such attachments may be observed 



in the training data, and we can formulate the frequencies of attachments (15) as a syntactic pref- 
erence. Considering the fact that individual rules will be applied with different frequency, it is 
desirable to modify the syntactic preference to 

f(h,h, ■ ■ ■ ,h,{L — » Ri, R2, ■ ■ ■ , Rk)) 
f((L — > R\,R 2 , . . . , Rk)) 

where f((L — > R\,R 2 ,... , Rk)) denotes the frequence of application of CFG rule L — > R%, R 2 , . . . , Rk- 
This is precisely the 'length probability' we propose in this paper. 



5 This kind of syntactic preference requires that the CFG rules for coordinate structures have the form L 
Ri , C, R2 , C, . . . , C, Rk ■ 
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Let us now define the length probability more formally. Suppose that an attachment is obtained 
after the application of CFG rule L — > R\, R 2 , . . . , Rk, the lengths of Ri, R 2 , . . . , R/. are h,l 2 , ■ ■ ■ ,lk, 
respectively. The attachment can be viewed as being generated by the following conditional distri- 
bution: 

P{h,l 2 ,...,l k \(L^R 1 ,R 2 ,...,R k )). (17) 

We call this kind of conditional probability the 'length probability.' (] Furthermore, the syntactic 
likelihood of an interpretation is defined as the geometric mean of the length probabilities of the 
attachments in the syntactic tree of the interpretation, assuming that the attachments are mutually 
independent: 

m 

Psyn(I) = ([[Pi)&, (18) 

1=1 

where Pi is the ith length probability in the syntactic tree of interpretation /, and m the number 
of length probabilities in it. We define syntactic likelihood as the geometric mean of the length 
probabilities, rather than as the product of the length probabilities, in order to factor out the effect 
of the different number of attachments in the syntactic trees of individual interpretations. When 
training the length probabilities, the parameters in ( |i~7| ) may be estimated using the frequences in 

& 

Next, let us consider a simple example illustrating how the operation of this model indicates 
the functioning of RAP. For the phrase shown in Figure |2|(a), there are two interpretations; RAP 
would necessarily prefer the former. The difference between the syntactic likelihood values of the 
two interpretations is solely determined by 

P(l,5\(PP -> P,NP)) x P(2,6\(NP -» NP,PP)), (19) 

and 

P(1,2\(PP -> P,NP)) x P(5,3\(NP -» NP,PP)). (20) 

First, let us compare the left-hand length probabilities of ( |l9| ) and (^). Both represent an attach- 
ment of NP to P, and the length of P is 1 in both terms. Thus the two estimated probabilities may 
not differ so greatly. Next, compare the right-hand length probabilities in (|i~9| ) and (^). While 
both represent an attachment of PP to NP, the length of NP of the former is 2 and that of the 
latter is 5. Thus the second length probability in ( |l9|) is likely to be higher than that in (|20|), as in 
training data there are more phrases attached to nearby phrases than are attached to distant ones. 
Therefore, when we use only the syntactic likelihood to perform disambiguation, we can expect the 
former interpretation in Figure |2](a) to be preferred, i.e., we have an indication of the functioning 
of RAP. 

Let us consider another example illustrating how the operation of the length probability model 
indicates the functioning of ALPP. For the sentence shown in Figure |2](b), there are two interpre- 
tations; ALPP would necessarily prefer the former. The difference between the syntactic likelihood 
values of the two interpretations is solely determined by 

P(3,2\(VP^ VP,PP)) x P(l,l,l\(VP-> VP,C,VP)), (21) 



The number of parameters in a length probability model depends on k - the number of categories on the right- 
hand side of a CFG rule, and N - the maximum value of lengths of a category on the left-hand side of the rule: 

J^iLfc-l ( ft * 1 ^ ^ = ^^^ ^'^ s ^^ s ver y sma U (in our case k S 3), the number of parameters in a length 
probability model is of TV's polynomial order. 
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and 



P(1,2|(VP -» VP,PP)) x P(1 5 1,3|(VP -> VP,C,VP)). 



(22) 



First, let us compare the left-hand length probabilities in ( |21| ) and fl22j). Both represent an at- 
tachment of PP to VP, but the length of VP of the former is 3 and that of the latter is 1. The 
left-hand probability i n (pi] ) is likely to be lower than that in (|22"|), Next, compare the right-hand 
length probabilities in (|21|) and (|22]). Both represent a coordinate structure consisting of VPs. The 
lengths of VPs in the latter are equal, while the lengths of VPs in the former are not. Thus the 



right-hand probability in (|2lj) is likely to be higher than that in (|22j). Moreover, the difference 
between the right-hand probabilities is likely to be higher than that between the left-hand proba- 
bilities, and thus the syntactic likelihood value of the former interpretation will be higher than that 
of the latter. Therefore, when we use only the syntactic likelihood to perform disambiguation, we 
can expect the former interpretation in Figure @(b) to be preferred. 



the block on the table in the room 
NP:B 

NR:2 PP:6 


the block P:1 NP^5 




on NP:2 


PP:3 


the table in 


the room 


NP:8 

NP:5 

NP:2 PP:3 


PP:3 


the block P:1 NP:2 i n 


the room 


on the table 




Non-terminal : length 





A number of companies sell and buy by computer 
VP:5 




by computer 



sell and 




sell and 



V,P:1 P,P:2 



buy by computer 
Non-terminal: length 



(a) (b) 
Figure 2: Examples of syntactic parsing 



4.4 The syntactic parsing approach 



Another approach to disambiguation is to define a probability model on the basis of syntactic 
parsing. One method of this type employs the well-known PCFG (Probabilistic Context Free 
Grammar) model [Fujisaki 8£, Jclinck et al. 90] , Lari and Young 90| . In PCFG, a CFG rule having 
the form of a — > (3 is associated with a conditional probability P(j3\a), and the likelihood of 
a syntactic tree is defined as the product of the conditional probabilities of the rules which are 
applied in the derivation of that tree. Other methods have also been proposed. Magerman & 
Marcus, for instance, have proposed making use of a conditional probability model specifying 
a conditional probability of a CFG rule, given the part-of-speech trigram it dominates and its 



parent rule | Magerman and Marcus 91 1. Black et al. have defined a richer model to utilize all the 
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information in the top-down derivation of a non-terminal [Black et al. 92]. Briscoe & Carroll have 
proposed using a probabilistic model specific to LR parsing [ Briscoe and Carroll 93 1. 

The advantage of the syntactic parsing approach is that it may embody heuristics (principles) 
effective in disambiguation, which would not have been thought of by humans, but it also risks not 
embodying heuristics (principles) already known to be effective in disambiguation. For example, 
the two interpretations of the noun phrase shown in Figure ^(a) have an equal likelihood value, if 
we employ PCFG, although the former would be preferred according to RAP. 



5 The Back-Off Method 

Having defined a lexical likelihood based on LPR and a syntactic likelihood based on RAP and 
ALPP, we may next consider how to combine the two kinds of likelihood in disambiguation. One 
choice is to calculate total preference as a weighted average of likelihood values, as proposed in 



| Alshawi and Carter 95]. However since LPR overrides RAP and ALPP, a simpler approach is to 



adopt the back-off method, i.e., to rank interpretations I\ and I 2 as follows: 

1. if Piexih) ~ Piexih) > V then h > h 

2. else if Pi ex {h) - Piex(h) > V then h>h ^ 

3. else if P syn {h) - P sy n(h) > r then h > I 2 { ) 

4. else if P syn {h) - P S yn{h) > r then I 2 > h 

where I\ and I 2 denote any two interpretations, Pi ex () denotes the lexical likelihood of an interpre- 
tation, and P S ynQ the syntactic likelihood of an interpretation. 77 > and r > are thresholds (in 
the experiment described later, both are set to 0). Note that in lines 3 and 4, \Pi ex (h)—Piex{h)\ < V 
holds. Further note that the preferential order cannot be determined (or can only be determined 
at random) when \Pi ex {h) ~ Plex{h)\ < V and \P sy n(h) ~ P S yn{h)\ < r. 

6 Experimental Results 

We have conducted experiments to test the effectiveness of our proposed method. This section 
describes the results. In the experiments, we considered only resolving pp-attachment ambigui- 
ties and coordinate structure ambiguities. These two kinds of ambiguities are typical, and other 
ambiguities can be resolved in the same way [ Hobbs and Bear 90(| . 



We first defined 12 CFG rules as our grammar to be used by a parser which calculates a pref- 
erence for each partial interpretation, and always retains the N most preferable partial interpreta- 
tions^]. We have not yet actually constructed such a parser, however, and use a parser called 'SAX,' 



previously developed by Matsumoto & Sugimura |Matsumoto and Sugimura 8q] , which calculates 
a preference for each interpretation after it obtains all the interpretations. 

We then trained the parameters of probability models. We extracted 181, 250 case frames from 



the WSJ (Wall Street Journal) bracketed corpus of the Penn Tree Bank [Marcus et al. 93]. We 
used these data to estimate three-word probabilities and two-word probabilities. Furthermore, 
we extracted 963 sentences from the WSJ tagged corpus of the Penn Tree Bank. We used SAX 
to analyze the sentences and selected the correct syntactic trees by hand. We then employed 



7 It is n ecessary to do so, as t he number of ambiguities will increase drastically when the length of an input sentence 



increases [Church and Patil 82 
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the Maximum Likelihood Estimator to estimate length probabilities using the selected syntactic 
trees, e.g., if CFG rule NP — * NP, PP is applied x times, and among the attachments obtained by 
applying this rule, Xi of them have the lengths of 2 and 3, then the length probability P(2, 3\(NP — > 
NP,PP)) is estimated as ^£ It is known, in statistics, that the number of samples required for 
accurate estimation of a probabilistic model is roughly proportional to the number of parameters 
in the target model, and thus the data used for training length probabilities were nearly sufficient. 
Figure |3] plots the estimated length probabilities versus the lengths, for two CFG rules. The result 
indicates that there are more attachments attached to nearby phrases than are attached to distant 
ones in the training data. Moreover, the length probabilities for CFG rule VP — > VP, PP and 
those for CFG rule NP — > NP, PP show different distribution patterns, suggesting that syntactic 
preference is a function of a CFG rule. 




(a) (b) 
Figure 3: Length probability versus length 



We then extracted 249 sentences from a part of the tagged WSJ corpus which was not used 
in training as our test data and analyzed the sentences. When analizing a sentence, we rank the 
obtained interpretations as follows: 

if Piexsih) > Piexsih) then I x > I 2 

else if Piex3(h) > Plex2>{h) then I 2 > I\ 

else if Piexzih) > Piex2(h) then h > I 2 

else if Piex2(h) > Piex2(h) then h> h 

else if P S yn(h) > P S yn{h) then h > I 2 

else if Pgynih) > P sy n{h) then I 2 > I x 

where I\ and I 2 denote any two interpretations. Piex?,0 denotes the lexical likelihood value of 
an interpretation calculated as the geometric mean of three- word probabilities, Pi eX 2() the lexical 
likelihood value of an interpretation calculated as the geometric mean of two- word probabilities, and 
PsynO the syntactic likelihood value of an interpretation. The average number of interpretations 
obtained in the analysis of a sentence was 2.4. 

The number 1 accuracy obtained was 89.2% (Table Q represents this result as 'Lex3+Lex2+Syn'), 
where the number n accuracy is defined as the fraction of the test sentences whose preferred inter- 
pretation is successfully ranked in the first n candidates. We feel that this result is very encouraging. 
Table [2] shows the breakdown of the result, in which 'Lex3' stands for the proportion determined by 
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Table 1: Disambiguation results 



Method 


Accuracy(%) 


Lex3+Lex2+Syn 

Lex3+Lex2+PCFG 

Lex3(Lex2)xSyn 


89.2 
86.7 
87.1 



Table 2: Breakdown of 'Lex3+Lex2+Syn' 





Correct 


Incorrect 


Total 


Lex3 


112 


5 


117 


Lex2 


94 


14 


108 


Syn 


16 


8 


24 


Total 


222 


27 


249 



using lexical likelihood Pi eX 3, 'Lex2' by using lexical likelihood Pi eX 2, and 'Syn' by using syntactic 
likelihood P syn . The accuracies of 'Lex3,' 'Lex2,' and 'Syn' were 95.7%, 87.0%, and 66.7%, respec- 
tively. Furthermore, 'Lex3,' 'Lex2,' and 'Syn' formed 47.0%, 43.4%, and 9.6% of the disambiguation 
results, respectively. 

We further examined the types of mistakes made by our method. First, there were some mistakes 
by 'Syn.' For example, in 

Rain washes the fertilizers off the land, (25) 

there are two interpretations. The lexical likelihood values Pi eX 3 of the two interpretations were 
calculated as 0, and the lexical likelihood values Pi eX 2 of the two interpretations were calculated as 
0, as well. The interpretations were ranked by the syntactic likelihood P syn , and the interpretation 
of attaching the 'off' phrase to 'fertilizer' was mistakenly preferred. We also found some mistakes 




Figure 4: The top 5 accuracies 
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by 'Lex2.' For example, in 



The parents reclaimed the child under the circumstances, (26) 

there are two interpretations. The lexical likelihood values Pi eX 3 of the two interpretations were 
calculated as 0. The lexical likelihood value P[ eX 2 of the interpretation of attaching 'under' phrase 
to 'child' was higher than that of attaching it to 'reclaim,' as there were many expressions like 'a 
child under five' observed in the training data. And thus the former interpretation was mistakenly 
preferred. It is obvious that these kinds of mistakes could be avoided if more data were available. 
We conclude that the most effective way of improving disambiguation results is to increase data 
for training lexical preference. 

We further checked the disambiguation decisions made by 'Syn' when 'Lex3' and 'Lex2' fail to 
work, and found that all of the prepositional phrases in these sentences were attached to nearby 
phrases by 'Syn,' indicating that using syntactic likelihood can help to achieve a functioning of 
RAP. One may argue that we could obtain the same number 1 accuracy if we were to employ 
a deterministic approach in implementing RAP. As we pointed out earlier, however, if we are to 
obtain the N most preferred interpretations, we need to use syntactic likelihood. To verify that 
the syntactic likelihood is indeed useful, we conducted the following additional experiment. We 
ranked the interpretations of each of the 249 test sentences using only syntactic likelihood. We also 
selected the interpretation with phrases always attached to nearby phrases as the most preferred 
ones, and randomly selected interpretations from what remain as the nth most preferred ones. We 
evaluated the results on the basis of the number n accuracy. Figure || shows the top 5 accuracies 
of the stochastic approach and the deterministic approach. The results indicate that the former 
outperforms the latter. (The number 2 accuracy for both methods increases drastically, as many 
test sentences have only two interpretations.) The improvement is not significant, however. We 
expect the effect of the use of the syntactic likelihood to become more significant when longer 
sentences are used in future analyses. 

In place of a length probability model, we used PCFG for calculating syntactic preference. 
We employed the Maximum Likelihood Estimator to estimate the parameters of PCFG (we did 
not use the so-called 'inside-outside algorithm' [ Jelinek et al. 90 , Lari and Young 9"0| ]), making use 



of the same training data as those used for the length probability model. Table |] represents this 
result as 'Lex3+Lex2+PCFG.' Our experimental results indicate that our method of using a length 
probability model outperforms that of using PCFG. 

Instead of the back-off method, we used the product of lexical likelihood values and syntactic 
likelihood values to rank interpretations. When using lexical likelihood, we use a lexical likeli- 
hood value calculated from three-word probabilities, provided that it is not 0; otherwise we use 
a lexical likelihood value calculated from two-word probabilities. Table [I] represents this result as 
'Lex3(Lex2) xSyn.' When the preference values of all of the interpretations obtained are calculated 
as 0, we rank the interpretations at random. Our results indicate that it is preferable to employ 
the back-off method. 



7 Concluding Remarks 

We have proposed a probabilistic method of disambiguation based on psycholinguistic principles. 
Our main proposals are: (a) to unify the psycholinguistic approach and the probabilistic approach, 
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specifically, to implement psyche-linguistic principles on the basis of probabilistic methodology, (b) 
to use the notion of 'length' in defining a probabilistic model for the implementation of RAP and 
ALPP, and (c) to employ the back-off method to combine the use of lexical likelihood with that of 
syntactic likelihood. Our experimental results indicate that our method is quite effective. 
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